281 research outputs found

    Per Aspera ad Astra: On the Way to Parallel Processing

    Get PDF
    Computational Science and Engineering is being established as a third category of scientific methodology; this innovative discipline supports and supplements the traditional categories: theory and experiment, in order to solve the problems arising from complex systems challenging science and technology. While the successes of the past two decades in scientific computing have been achieved essentially by the technical breakthrough of the vector-supercomputers, today the discussion about the future of supercomputing is focussed on massively parallel computers. The discrepancy, however, between peak performance and sustained performance achievable with algorithmic kernels, software packages, and real applications is still disappointingly high. An important issue are programming models. While Message Passing on parallel computers with distributed memory is the only efficient programming paradigm available today, from a user's point of view it is hard to imagine that this programming model, rather than Shared Virtual Memory, will be capable to serve as the central basis in order to bring computing on massively parallel systems from a sheer computer science trend to the technological breakthrough needed to deal with the large applications of the future; this is especially true for commercial applications where explicit programming the data communication via Message Passing may turn out to be a huge software-technological barrier which nobody might be willing to surmount.KFA Jülich is one of the largest big-science research centres in Europe; its scientific and engineering activities are ranging from fundamental research to applied science and technology. KFA's Central Institute for Applied Mathematics (ZAM) is running the large-scale computing facilities and network systems at KFA and is providing communication services, general-purpose and supercomputer capacity also to the HLRZ ("Höchstleistungsrechenzentrum") established in 1987 in order to further enhance and promote computational science in Germany. Thus, at KFA - and in particular enforced by ZAM - supercomputing has received high priority since more than ten years. What particle accelerators mean to experimental physics, supercomputers mean to Computational Science and Engineering: Supercomputers are the accelerators of theory

    Grid-Computing

    Get PDF
    "Grid-Computing", ein Mitte der 90er Jahre eingeführter Begriff, bezeichnet eine Architektur für verteilte Systeme, die auf dem World Wide Web aufbaut und die Web-Vision erweitert. Mit dem Grid-Computing werden die Ressourcen einer Gemeinschaft, einer sogenannten “virtuellen Organisation” (siehe unten), integriert. Die Hoffnung ist, dass hierdurch rechen- und/oder datenintensiven Aufgaben, die eine einzelne Organisation nicht lösen kann, handhabbar werden. Ein “Grid” bezeichnet eine nach dem Grid-Computing-Ansatz aufgebaute Rechner-, Netzwerk- und Software-Infrastruktur zur Teilung von Ressourcen mit dem Ziel, die Aufgaben einer virtuellen Organisation zu erledigen. Zu Beginn war die Möglichkeit, ungenutzte CPU-Ressourcen an anderen Stellen für die eigenen Aufgaben einzusetzen, die wesentlich treibende Kraft für erste Experimente. Internet-Computing-Projekte wie SETI@Home, distributed.net u.a., bei denen die unbenutzten Rechenzyklen von weltweit verteilten privaten PCs verwendet werden, illustrieren das Potential des Grid-Computing. Die heutigen Grid-Konzepte und die ersten -Prototypen gehen weit über diese Anfänge hinaus. Sie versprechen die transparente Bereitstellung von Diensten unabhängig von der räumlichen Nähe. Es wird erwartet, dass das Grid-Computing die Nutzung von Rechnern und Rechnernetzen so grundlegend verändern wird, wie das Web den Datenaustausch bereits verändert hat

    Turning Privacy Constraints into Syslog Analysis Advantage

    Get PDF
    Nowadays, failures in high performance computers (HPC) became the norm rather than the exception [10]. In the near future, the mean time between failures (MTBF) of HPC systems is expected to be too short, and that current failure recovery mechanisms e.g., checkpoint-restart, will no longer be able to recover the systems from failures [1]. Early fail- ure detection is a new class of failure recovery methods that can be beneficial for HPC systems with short MTBF. De- tecting failures in their early stage can reduce their negative effects by preventing their propagation to other parts of the system [3]. The goal of the current work is to contribute to the foun- dation of failure detection techniques via sharing an ongo- ing research with the community. Herein we consider user privacy as the main priority, and then turning the applied constraint for protecting users privacy into an advantage for analyzing the system behavior. We use De-identification, constantification, and hashing to reach this goal. Our ap- proach also contributes to the reproducibility and openness of future research in the field. Via this approach, system ad- ministrators can unwarily share their syslogs with the public domain

    Performance-Analyse paralleler Programme: Die PARvis-Visualisierungsumgebung

    Get PDF
    PARvis ist eine Visualisierungsumgebung, die eine gegebene Trace-Datei in eine Reihe verschiedener graphischer Darstellungen, z.B. Momentaufnahmen, Statistiken oder auch Zeitachsendarstellungen, transferiert. Dies erleichtert die Programmoptimierung, wodurch der Entwicklungszyklus auf massiv-parallelen Rechnersystemen deutlich verkürzt wird. PARvis unterstützt die gängigen Programmiermodelle (physikalisch/virtuell gemeinsamer Speicher, Message-Passing) und ist auf einer breiten Palette von Workstations ablauffähig

    Linux Cluster in Theory and Practice: A Novel Approach in Teaching Cluster Computing Based on the Intel Atom Platform

    Get PDF
    AbstractCurrent trends and studies on future architectures show, that the complexity of parallel computer systems is increasing steadily. Hence, the industry requires skilled employees, who have in addition to the theoretical fundamentals, practical experiences in the design and administration of such systems. However, investigations have shown, that practical approaches are still missing in current curricula, especially in these areas. For this reason, the chair of Computer Architecture at the faculty of Computer Science at the Technische Universiẗat Dresden, developed and introduced the course “Linux Cluster in Theory and Practice” (LCTP). The main objectives of this course are to provide background knowledge about the design and administration of large-scale parallel computer systems and the practical implementation on the available hardware. In addition, students learn how to solve problems in a structured approach and as part of a team. This paper analyzes the current variety of courses in the area of parallel computing systems, describes the structure and implementation of LCTP and provides first conclusions and an outlook on possible further developments

    Detecting Memory-Boundedness with Hardware Performance Counters

    Get PDF
    Modern processors incorporate several performance monitoring units, which can be used to count events that occur within different components of the processor. They provide access to information on hardware resource usage and can therefore be used to detect performance bottlenecks. Thus, many performance measurement tools are able to record them complementary to information about the application behavior. However, the exact meaning of the supported hardware events is often incomprehensible due to the system complexity and partially lacking or even inaccurate documentation. For most events it is also not documented whether a certain rate indicates a saturated resource usage. Therefore, it is usually diffcult to draw conclusions on the performance impact from the observed event rates. In this paper, we evaluate whether hardware performance counters can be used to measure the capacity utilization within the memory hierarchy and estimate the impact of memory accesses on the achieved performance. The presented approach is based on a small selection of micro-benchmarks that constantly stress individual components in the memory subsystem, ranging from caches to main memory. These workloads are used to identify hardware performance counters that provide good estimates for the utilization of individual components in the memory hierarchy. However, since access latencies can be interleaved with computing instructions, a high utilization of the memory hierarchy does not necessarily result in low performance. We therefore also investigate which stall counters provide good estimates for the number of cycles that are actually spent waiting for the memory hierarchy

    Lessons learned from spatial and temporal correlation of node failures in high performance computers

    Get PDF
    In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures

    Analysis of Node Failures in High Performance Computers Based on System Logs

    Get PDF
    The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations between observed node failures and enabling failure detection as early as possible. The results aimed to help system administrators minimize (or prevent) the destructive effects of failures

    HAEC-SIM: A Simulation Framework for Highly Adaptive Energy-Efficient Computing Platforms

    Get PDF
    This work presents a new trace-based parallel discrete event simulation framework designed for predicting the behavior of a novel computing platform running energy-aware parallel applications. Discrete event traces capture the runtime be- havior of parallel applications on existing systems and form the basis for the simulation. The simulation framework pro- cesses the events of the input trace by applying simulation models that modify event properties. Thus, the output are again event traces that describe the predicted application behavior on the simulated target platform. Both input and simulated traces can be visualized and analyzed with estab- lished tools. The modular design of the framework enables the simulation of different aspects such as temporal perfor- mance and energy efficiency by applying distinct simulation models e.g.: (i) A performance model for communication that allows to evaluate the target communication topology and link properties. (ii) An energy model for computations that is based on measurements of current hardware. We showcase the potential of this simulation by simulating the execution of benchmark applications to explore design al- ternatives of highly adaptive and energy-efficient computing applications and platforms

    VAMPIR: Visualization and Analysis of MPI Resources

    Get PDF
    Performance analysis most often is based on the detailed knowledge of program behavior. One option to get this information is tracing. Based on the research tool PARvis, the visualization environment VAMPIR was developed at KFA which now supports the new message passing standard MPI. VAMPIR translates a given trace file into a variety of graphical views, e.g., state diagrams, activity charts, time-line displays, and statistics. Moreover, it supports an animation mode that can help to locate performance bottlenecks, and it provides flexible filter operations to reduce the amount of information displayed. The most interesting part of VAMPIR is the powerful zooming feature that allows to identify problems at any level of detail
    • …
    corecore